{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pinecone Canopy library quick start notebook\n",
    "\n",
    "**Canopy** is a Software Development Kit (SDK) for AI applications. Canopy allows you to test, build and package Retrieval Augmented Applications with Pinecone Vector Database. \n",
    "\n",
    "This notebook introduces the quick start steps for working with Canopy library. You can find more details about this project and advanced use in the project [documentation](../README.md).\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "install canopy library"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip available: \u001b[0m\u001b[31;49m22.2.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3.1\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!pip install -qU canopy-sdk"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, Canopy uses Pinecone and OpenAI so we need to configure the related API keys.\n",
    "\n",
    "To get Pinecone free trial API key and environment register or log into your Pinecone account in the [console](https://app.pinecone.io/). You can access your API key from the \"API Keys\" section in the sidebar of your dashboard, and find the environment name next to it.\n",
    "\n",
    "You can find your free trial OpenAI API key [here](https://platform.openai.com/account/api-keys). You might need to login or register to OpenAI services.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"PINECONE_API_KEY\"] = os.environ.get('PINECONE_API_KEY') or 'YOUR_PINECONE_API_KEY'\n",
    "os.environ[\"PINECONE_ENVIRONMENT\"] = os.environ.get('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'\n",
    "os.environ[\"OPENAI_API_KEY\"] = os.environ.get('OPENAI_API_KEY') or 'OPENAI_API_KEY'"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pinecone Documentation Dataset"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll load a crawl from 25/10/23 of pinecone docs [website](https://docs.pinecone.io/docs/).\n",
    "\n",
    "We will use this data to demonstrate how to build a RAG pipeline to answer questions about Pinecone DB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>text</th>\n",
       "      <th>source</th>\n",
       "      <th>metadata</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>728aeea1-1dcf-5d0a-91f2-ecccd4dd4272</td>\n",
       "      <td># Scale indexes\\n\\n[Suggest Edits](/edit/scali...</td>\n",
       "      <td>https://docs.pinecone.io/docs/scaling-indexes</td>\n",
       "      <td>{'created_at': '2023_10_25', 'title': 'scaling...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2f19f269-171f-5556-93f3-a2d7eabbe50f</td>\n",
       "      <td># Understanding organizations\\n\\n[Suggest Edit...</td>\n",
       "      <td>https://docs.pinecone.io/docs/organizations</td>\n",
       "      <td>{'created_at': '2023_10_25', 'title': 'organiz...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>b2a71cb3-5148-5090-86d5-7f4156edd7cf</td>\n",
       "      <td># Manage datasets\\n\\n[Suggest Edits](/edit/dat...</td>\n",
       "      <td>https://docs.pinecone.io/docs/datasets</td>\n",
       "      <td>{'created_at': '2023_10_25', 'title': 'datasets'}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1dafe68a-2e78-57f7-a97a-93e043462196</td>\n",
       "      <td># Architecture\\n\\n[Suggest Edits](/edit/archit...</td>\n",
       "      <td>https://docs.pinecone.io/docs/architecture</td>\n",
       "      <td>{'created_at': '2023_10_25', 'title': 'archite...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>8b07b24d-4ec2-58a1-ac91-c8e6267b9ffd</td>\n",
       "      <td># Moving to production\\n\\n[Suggest Edits](/edi...</td>\n",
       "      <td>https://docs.pinecone.io/docs/moving-to-produc...</td>\n",
       "      <td>{'created_at': '2023_10_25', 'title': 'moving-...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     id  \\\n",
       "0  728aeea1-1dcf-5d0a-91f2-ecccd4dd4272   \n",
       "1  2f19f269-171f-5556-93f3-a2d7eabbe50f   \n",
       "2  b2a71cb3-5148-5090-86d5-7f4156edd7cf   \n",
       "3  1dafe68a-2e78-57f7-a97a-93e043462196   \n",
       "4  8b07b24d-4ec2-58a1-ac91-c8e6267b9ffd   \n",
       "\n",
       "                                                text  \\\n",
       "0  # Scale indexes\\n\\n[Suggest Edits](/edit/scali...   \n",
       "1  # Understanding organizations\\n\\n[Suggest Edit...   \n",
       "2  # Manage datasets\\n\\n[Suggest Edits](/edit/dat...   \n",
       "3  # Architecture\\n\\n[Suggest Edits](/edit/archit...   \n",
       "4  # Moving to production\\n\\n[Suggest Edits](/edi...   \n",
       "\n",
       "                                              source  \\\n",
       "0      https://docs.pinecone.io/docs/scaling-indexes   \n",
       "1        https://docs.pinecone.io/docs/organizations   \n",
       "2             https://docs.pinecone.io/docs/datasets   \n",
       "3         https://docs.pinecone.io/docs/architecture   \n",
       "4  https://docs.pinecone.io/docs/moving-to-produc...   \n",
       "\n",
       "                                            metadata  \n",
       "0  {'created_at': '2023_10_25', 'title': 'scaling...  \n",
       "1  {'created_at': '2023_10_25', 'title': 'organiz...  \n",
       "2  {'created_at': '2023_10_25', 'title': 'datasets'}  \n",
       "3  {'created_at': '2023_10_25', 'title': 'archite...  \n",
       "4  {'created_at': '2023_10_25', 'title': 'moving-...  "
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "data = pd.read_parquet(\"https://storage.googleapis.com/pinecone-datasets-dev/pinecone_docs_ada-002/raw/file1.parquet\")\n",
    "data.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each record in this dataset represents a single page in Pinecone's documentation. Each row contains a unique id, the raw text of the page in markdown language, the url of the page as \"source\" and some metadata. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Init a Tokenizer\n",
    "\n",
    "\n",
    "Many of Canopy's components are using tokenization, which is a process that splits text into tokens - basic units of text (like word or sub-words) that are used for processing. Therefore, Canopy uses a singleton `Tokenizer` object which needs to be initialized once."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "from canopy.tokenizer import Tokenizer\n",
    "Tokenizer.initialize()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After initializing the global object, we can simply create an instance from anywhere in our code, without providing any parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Hello', ' world', '!']"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from canopy.tokenizer import Tokenizer\n",
    "\n",
    "tokenizer = Tokenizer()\n",
    "\n",
    "tokenizer.tokenize(\"Hello world!\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a KnowledgBase to store our data for search\n",
    "\n",
    "The `KnowledgeBase` object is responsible for storing and indexing textual documents.\n",
    "\n",
    "Once documents are indexed, the `KnowledgeBase` can be queried with a new unseen text passage, for which the most relevant document chunks are retrieved.\n",
    "\n",
    "The `KnowledgeBase` holds a connection to a Pinecone index and provides a simple API to insert, delete and search textual documents.\n",
    "\n",
    "The `KnowledgeBase`'s `upsert()` operation is used to index new documents, or update already stored documents. The `upsert` process splits each document's text into smaller chunks, transforms these chunks to vector embeddings, then upserts those vectors to the underlying Pinecone index. At Query time, the `KnowledgeBase` transforms the textual query text to a vector in a similar manner, then queries the underlying Pinecone index to retrieve the top-k most closely matched document chunks.\n",
    "\n",
    "Here we create a `KnowledgeBase` with our desired index name: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "from canopy.knowledge_base import KnowledgeBase\n",
    "\n",
    "INDEX_NAME = \"my-index\"\n",
    "\n",
    "kb = KnowledgeBase(index_name=INDEX_NAME)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the first one-time setup of a new Canopy service, an underlying Pinecone index needs to be created. If you have created a Canopy-enabled Pinecone index before - you can skip this step.\n",
    "\n",
    "Note: Since Canopy uses a dedicated data schema, it is not recommended to use a pre-existing Pinecone index that wasn't created by Canopy's `create_canopy_index()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "from canopy.knowledge_base import list_canopy_indexes\n",
    "\n",
    "if not any(name.endswith(INDEX_NAME) for name in list_canopy_indexes()):\n",
    "    kb.create_canopy_index(indexed_fields=[\"title\"])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see the index created in Pinecone's [console](https://app.pinecone.io/)\n",
    "\n",
    "next time we would like to init a knowledge base instance to this index, we can simply call the connect method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "kb = KnowledgeBase(index_name=INDEX_NAME)\n",
    "kb.connect()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 💡 Note: a knowledge base must be connected to an index before executing any operation. You should call `kb.connect()` to connect  an existing index or call `kb.create_canopy_index(INDEX_NANE)` before calling any other method of the KB "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Upsert data to our KnowledgBase"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we need to convert our dataset to list of `Document` objects\n",
    "\n",
    "Each document object can hold id, text, source and metadata:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "from canopy.models.data_models import Document\n",
    "\n",
    "example_docs = [Document(id=\"1\",\n",
    "                      text=\"This is text for example\",\n",
    "                      source=\"https://url.com\"),\n",
    "                Document(id=\"2\",\n",
    "                        text=\"this is another text\",\n",
    "                        source=\"https://another-url.com\",\n",
    "                        metadata={\"my-key\": \"my-value\"})]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data in our example dataset is already provided in this schema, so we can simply iterate over it and instantiate `Document` objects:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = [Document(**row) for _, row in data.iterrows()]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are ready to upsert our data, with only a single command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a80eb9ab18ef4a10b104ab9af8e208ef",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/6 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from tqdm.auto import tqdm\n",
    "\n",
    "batch_size = 10\n",
    "\n",
    "for i in tqdm(range(0, len(documents), batch_size)):\n",
    "    kb.upsert(documents[i: i+batch_size])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Internally, the KnowledgeBase handles all the processing needed to Index the documents. Each document's text is chunked to smaller pieces and encoded to vector embeddings that can be then upserted directly to Pinecone. Later in this notebook we'll learn how to tune and customize this process."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Query the KnowledgeBase\n",
    "\n",
    "Now we can query the knowledge base. The KnowledgeBase will use its default parameters like `top_k` to execute the query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_query_results(results):\n",
    "    for query_results in results:\n",
    "        print('query: ' + query_results.query + '\\n')\n",
    "        for document in query_results.documents:\n",
    "            print('document: ' + document.text.replace(\"\\n\", \"\\\\n\"))\n",
    "            print(\"title: \" + document.metadata[\"title\"])\n",
    "            print('source: ' + document.source)\n",
    "            print(f\"score: {document.score}\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "query: p1 pod capacity\n",
      "\n",
      "document: ### s1 pods\\n\\n\\nThese storage-optimized pods provide large storage capacity and lower overall costs with slightly higher query latencies than p1 pods. They are ideal for very large indexes with moderate or relaxed latency requirements.\\n\\n\\nEach s1 pod has enough capacity for around 5M vectors of 768 dimensions.\\n\\n\\n### p1 pods\\n\\n\\nThese performance-optimized pods provide very low query latencies, but hold fewer vectors per pod than s1 pods. They are ideal for applications with low latency requirements (<100ms).\\n\\n\\nEach p1 pod has enough capacity for around 1M vectors of 768 dimensions.\n",
      "title: indexes\n",
      "source: https://docs.pinecone.io/docs/indexes\n",
      "score: 0.844001234\n",
      "\n",
      "document: ## Pod storage capacity\\n\\n\\nEach **p1** pod has enough capacity for 1M vectors with 768 dimensions.\\n\\n\\nEach **s1** pod has enough capacity for 5M vectors with 768 dimensions.\\n\\n\\n## Metadata\\n\\n\\nMax metadata size per vector is 40 KB.\\n\\n\\nNull metadata values are not supported. Instead of setting a key to hold a null value, we recommend you remove that key from the metadata payload.\\n\\n\\nMetadata with high cardinality, such as a unique value for every vector in a large index, uses more memory than expected and can cause the pods to become full.\n",
      "title: limits\n",
      "source: https://docs.pinecone.io/docs/limits\n",
      "score: 0.842709482\n",
      "\n",
      "document: #### p2 pod type (Public Preview)(\"Beta\")\\n\\n\\nThe new [p2 pod type](indexes/#p2-pods) provides search speeds of around 5ms and throughput of 200 queries per second per replica, or approximately 10x faster speeds and higher throughput than the p1 pod type, depending on your data and network conditions. \\n\\n\\nThis is a **public preview** feature and is not appropriate for production workloads.\\n\\n\\n#### Improved p1 and s1 performance\\n\\n\\nThe [s1](indexes/#s1-pods) and [p1](indexes/#p1-pods) pod types now offer approximately 50% higher query throughput and 50% lower latency, depending on your workload.\n",
      "title: release-notes\n",
      "source: https://docs.pinecone.io/docs/release-notes\n",
      "score: 0.834972441\n",
      "\n",
      "document: ### p2 pods\\n\\n\\nThe p2 pod type provides greater query throughput with lower latency. For vectors with fewer than 128 dimension and queries where `topK` is less than 50, p2 pods support up to 200 QPS per replica and return queries in less than 10ms. This means that query throughput and latency are better than s1 and p1.\\n\\n\\nEach p2 pod has enough capacity for around 1M vectors of 768 dimensions. However, capacity may vary with dimensionality.\\n\\n\\nThe data ingestion rate for p2 pods is significantly slower than for p1 pods; this rate decreases as the number of dimensions increases. For example, a p2 pod containing vectors with 128 dimensions can upsert up to 300 updates per second; a p2 pod containing vectors with 768 dimensions or more supports upsert of 50 updates per second. Because query latency and throughput for p2 pods vary from p1 pods, test p2 pod performance with your dataset.\\n\\n\\nThe p2 pod type does not support sparse vector values.\n",
      "title: indexes\n",
      "source: https://docs.pinecone.io/docs/indexes\n",
      "score: 0.832246363\n",
      "\n",
      "document: ## Number of vectors\\n\\n\\nThe most important consideration in sizing is the [number of vectors](/docs/insert-data/) you plan on working with. As a rule of thumb, a single p1 pod can store approximately 1M vectors, while a s1 pod can store 5M vectors. However, this can be affected by other factors, such as dimensionality and metadata, which are explained below.\n",
      "title: choosing-index-type-and-size\n",
      "source: https://docs.pinecone.io/docs/choosing-index-type-and-size\n",
      "score: 0.828785\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from canopy.models.data_models import Query\n",
    "results = kb.query([Query(text=\"p1 pod capacity\")])\n",
    "\n",
    "print_query_results(results)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can change the `top_k` parameter, to determine the number of top query results that will be returned and also to provide a [metadata filter](https://docs.pinecone.io/docs/metadata-filtering)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "query: p1 pod capacity\n",
      "\n",
      "document: ## Pod storage capacity\\n\\n\\nEach **p1** pod has enough capacity for 1M vectors with 768 dimensions.\\n\\n\\nEach **s1** pod has enough capacity for 5M vectors with 768 dimensions.\\n\\n\\n## Metadata\\n\\n\\nMax metadata size per vector is 40 KB.\\n\\n\\nNull metadata values are not supported. Instead of setting a key to hold a null value, we recommend you remove that key from the metadata payload.\\n\\n\\nMetadata with high cardinality, such as a unique value for every vector in a large index, uses more memory than expected and can cause the pods to become full.\n",
      "title: limits\n",
      "source: https://docs.pinecone.io/docs/limits\n",
      "score: 0.842464507\n",
      "\n",
      "document: ## Retention\\n\\n\\nIn general, indexes on the Starter (free) plan are archived as collections and deleted after 7 days of inactivity; for indexes created by certain open source projects such as AutoGPT, indexes are archived and deleted after 1 day of inactivity. To prevent this, you can send any API request to Pinecone and the counter will reset.\\n\\nUpdated about 1 month ago \\n\\n\\n\\n---\\n\\n* [Table of Contents](#)\\n* + [Upserts](#upserts)\\n\t+ [Queries](#queries)\\n\t+ [Fetch and Delete](#fetch-and-delete)\\n\t+ [Namespaces](#namespaces)\\n\t+ [Pod storage capacity](#pod-storage-capacity)\\n\t+ [Metadata](#metadata)\\n\t+ [Retention](#retention)\n",
      "title: limits\n",
      "source: https://docs.pinecone.io/docs/limits\n",
      "score: 0.71726948\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from canopy.models.data_models import Query\n",
    "results = kb.query([Query(text=\"p1 pod capacity\",\n",
    "                          metadata_filter={\"source\": \"https://docs.pinecone.io/docs/limits\"},\n",
    "                          top_k=2)])\n",
    "\n",
    "print_query_results(results)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see above, using the metadata filter we get results only from the \"limits\" page"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Query the Context Engine\n",
    "\n",
    "`ContextEngine` is an object responsible for retrieving the most relevant context for a given query and token budget.  \n",
    "\n",
    "While `KnowledgeBase` retrieves the full `top-k` structured documents for each query including all the metadata related to them, the context engine in charge of transforming this information to a \"prompt ready\" context that can later feeded to an LLM. To achieve this the context engine holds a `ContextBuilder` object that takes query results from the knowledge base and returns a `Context` object. The `ContextEngine`'s default behavior is to use a `StuffingContextBuilder`, which simply stacks retrieved document chunks in a JSON-like manner, hard limiting by the number of chunks that fit the `max_context_tokens` budget. More complex behaviors can be achieved by providing a custom `ContextBuilder` class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "from canopy.context_engine import ContextEngine\n",
    "context_engine = ContextEngine(kb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"query\": \"capacity of p1 pods\",\n",
      "  \"snippets\": [\n",
      "    {\n",
      "      \"source\": \"https://docs.pinecone.io/docs/indexes\",\n",
      "      \"text\": \"### s1 pods\\n\\n\\nThese storage-optimized pods provide large storage capacity and lower overall costs with slightly higher query latencies than p1 pods. They are ideal for very large indexes with moderate or relaxed latency requirements.\\n\\n\\nEach s1 pod has enough capacity for around 5M vectors of 768 dimensions.\\n\\n\\n### p1 pods\\n\\n\\nThese performance-optimized pods provide very low query latencies, but hold fewer vectors per pod than s1 pods. They are ideal for applications with low latency requirements (<100ms).\\n\\n\\nEach p1 pod has enough capacity for around 1M vectors of 768 dimensions.\"\n",
      "    },\n",
      "    {\n",
      "      \"source\": \"https://docs.pinecone.io/docs/indexes\",\n",
      "      \"text\": \"### p2 pods\\n\\n\\nThe p2 pod type provides greater query throughput with lower latency. For vectors with fewer than 128 dimension and queries where `topK` is less than 50, p2 pods support up to 200 QPS per replica and return queries in less than 10ms. This means that query throughput and latency are better than s1 and p1.\\n\\n\\nEach p2 pod has enough capacity for around 1M vectors of 768 dimensions. However, capacity may vary with dimensionality.\\n\\n\\nThe data ingestion rate for p2 pods is significantly slower than for p1 pods; this rate decreases as the number of dimensions increases. For example, a p2 pod containing vectors with 128 dimensions can upsert up to 300 updates per second; a p2 pod containing vectors with 768 dimensions or more supports upsert of 50 updates per second. Because query latency and throughput for p2 pods vary from p1 pods, test p2 pod performance with your dataset.\\n\\n\\nThe p2 pod type does not support sparse vector values.\"\n",
      "    }\n",
      "  ]\n",
      "}\n",
      "\n",
      "# tokens in context returned: 412\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "result = context_engine.query([Query(text=\"capacity of p1 pods\", top_k=5)], max_context_tokens=512)\n",
    "\n",
    "print(result.to_text(indent=2))\n",
    "print(f\"\\n# tokens in context returned: {result.num_tokens}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see above, although we set `top_k=5`, context engine retreived only 3 results in order to satisfy the 512 tokens limit. Also, the documents in the context contain only the text and source and not all the metadata that is not necessarily needed by the LLM. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Knowledgeable chat engine\n",
    "\n",
    "Now we are ready to start chatting with our data!\n",
    "\n",
    "Canopy's `ChatEngine` is a one-stop-shop RAG-infused Chatbot. The `ChatEngine` wraps an underlying LLM such as OpenAI's ChatGPT, enhancing it by providing relevant context from the user's knowledge base. It also automatically phrases search queries out of the chat history and send them to the knowledge base."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "from canopy.chat_engine import ChatEngine\n",
    "chat_engine = ChatEngine(context_engine)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Tuple\n",
    "from canopy.models.data_models import Messages, UserMessage, AssistantMessage\n",
    "\n",
    "def chat(new_message: str, history: Messages) -> Tuple[str, Messages]:\n",
    "    messages = history + [UserMessage(content=new_message)]\n",
    "    response = chat_engine.chat(messages)\n",
    "    assistant_response = response.choices[0].message.content\n",
    "    return assistant_response, messages + [AssistantMessage(content=assistant_response)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "The capacity of p1 pods is enough for around 1 million vectors of 768 dimensions. Source: [Pinecone Documentation](https://docs.pinecone.io/docs/indexes)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import display, Markdown\n",
    "\n",
    "history = []\n",
    "response, history = chat(\"What is the capacity of p1 pods?\", history)\n",
    "display(Markdown(response))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "P1 pods are ideal for applications with low latency requirements, specifically those that require latencies of less than 100 milliseconds. Source: [Pinecone Documentation](https://docs.pinecone.io/docs/indexes)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "response, history = chat(\"And for what latency requirements does it fit?\", history)\n",
    "display(Markdown(response))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 💡 Note: Canopy calls the underlying LLM, providing both the user-provided chat history and a generated `Context` prompt. This might surpass the `ChatEngine`'s configured `max_prompt_tokens`. By default, the `ChatEngine` would truncate the oldest messages in the chat history to avoid exceeding this limit. This behavior in configurable, as explained in the [documentation](https://github.com/pinecone-io/canopy/blob/main/src/canopy/chat_engine/chat_engine.py)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Customization Example\n",
    "\n",
    "Canopy built as a modular library, where each component can fully be customized by the user.\n",
    "\n",
    "Before we start, we would like to have a quick overview of the inner components used by the knowledge base:\n",
    "\n",
    "- **Index**: A Pinecone index that holds the vector representations of the documents.\n",
    "- **Chunker**: A `Chunker` object that is used to chunk the documents into smaller pieces of text.\n",
    "- **Encoder**: An `RecordEncoder` object that is used to encode the chunks and queries into vector representations.\n",
    "\n",
    "In the following example, we show how you can customize the `Chunker` component used by the knowledge base.\n",
    "\n",
    "First, we will create a dummy chunker class that simply chunks the text by new lines `\\n`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List\n",
    "from canopy.knowledge_base.chunker.base import Chunker\n",
    "from canopy.knowledge_base.models import KBDocChunk\n",
    "\n",
    "class NewLineChunker(Chunker):\n",
    "\n",
    "     def chunk_single_document(self, document: Document) -> List[KBDocChunk]:\n",
    "        line_chunks = [chunk\n",
    "                       for chunk in document.text.split(\"\\n\")]\n",
    "        return [KBDocChunk(id=f\"{document.id}_{i}\",\n",
    "                           document_id=document.id,\n",
    "                           text=text_chunk,\n",
    "                           source=document.source,\n",
    "                           metadata=document.metadata)\n",
    "                for i, text_chunk in enumerate(line_chunks)]\n",
    "    \n",
    "     async def achunk_single_document(self, document: Document) -> List[KBDocChunk]:\n",
    "        raise NotImplementedError()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[KBDocChunk(id='id1_0', text='This is first line', source='example', metadata={'title': 'newline'}, document_id='id1'),\n",
       " KBDocChunk(id='id1_1', text='This is the second line', source='example', metadata={'title': 'newline'}, document_id='id1')]"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chunker = NewLineChunker()\n",
    "\n",
    "document = Document(id=\"id1\",\n",
    "                    text=\"This is first line\\nThis is the second line\",\n",
    "                    source=\"example\",\n",
    "                    metadata={\"title\": \"newline\"})\n",
    "chunker.chunk_single_document(document)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can initialize a new knowledge base to use our new chunker:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "kb = KnowledgeBase(index_name=INDEX_NAME,\n",
    "                   chunker=chunker)\n",
    "kb.connect()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And upsert our example document:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
    "kb.upsert([document])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "query: second line\n",
      "\n",
      "document: This is the second line\n",
      "title: newline\n",
      "source: example\n",
      "score: 0.928711653\n",
      "\n",
      "document: This is first line\n",
      "title: newline\n",
      "source: example\n",
      "score: 0.887627542\n",
      "\n"
     ]
    }
   ],
   "source": [
    "results = kb.query([Query(text=\"second line\",\n",
    "                          metadata_filter={\"title\": \"newline\"})])\n",
    "\n",
    "print_query_results(results)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see above, our knowledge base split the document by a new line as expected."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Delete the index once you are sure that you do not want to use it anymore. Once the index is deleted, you cannot use it again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "kb.delete_index()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "canopy-quick",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "9e9b81017be88d4d093a2a92984a986685ce96a6b6736b12c233fdf6b743e185"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
